Towards Language-Guided Visual Recognition via Dynamic Convolutions

نویسندگان

چکیده

In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach target, first propose novel convolution module called Language-guided Dynamic Convolution (LaConv). Its kernels dynamically generated based on natural language information, which can help extract differentiated features for different examples. Based the LaConv module, further build fully language-driven network, termed as LaConvNet, unify recognition reasoning in one forward structure. validate conduct extensive experiments seven benchmark datasets of three vision-and-language tasks, i.e., question answering, referring expression comprehension segmentation. The experimental results not only show competitive or better performance LaConvNet against existing networks, but also witness merits an structure, including compact low computational cost high generalization ability. Our source code is released SimREC project: https://github.com/luogen1996/LaConvNet .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Guided Visual Perception

People typically learn through exposure to visual stimuli associated with linguistic descriptions. For instance, teaching visual concepts to children is often accompanied by descriptions in text or speech. This motivates the question of how this learning process could be computationally modeled. In this dissertation we explored three settings, where we showed that combining language and vision ...

متن کامل

Visual Sign Language Recognition

We have developed the Hand Motion Understanding (HMU) system that understands static and dynamic signs of the Australian Sign Language (Auslan). The HMU system uses a visual 3D hand tracker for motion sensing, and an adaptive fuzzy expert system for classification of the signs. This paper presents the hand tracker that extracts 3D hand configuration data with 21 degrees-of-freedom (DOFs) from a...

متن کامل

Natural Language Guided Visual Relationship Detection

Reasoning about the relationships between object pairs in images is a crucial task for holistic scene understanding. Most of the existing works treat this task as a pure visual classification task: each type of relationship or phrase is classified as a relation category based on the extracted visual features. However, each kind of relationships has a wide variety of object combination and each ...

متن کامل

Iterative Visual Reasoning Beyond Convolutions

We present a novel framework for iterative visual reasoning. Our framework goes beyond current recognition systems that lack the capability to reason beyond stack of convolutions. The framework consists of two core modules: a local module that uses spatial memory [4] to store previous beliefs with parallel updates; and a global graph-reasoning module. Our graph module has three components: a) a...

متن کامل

Language Recognition via Sparse Coding

Spoken language recognition requires a series of signal processing steps and learning algorithms to model distinguishing characteristics of different languages. In this paper, we present a sparse discriminative feature learning framework for language recognition. We use sparse coding, an unsupervised method, to compute efficient representations for spectral features from a speech utterance whil...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Computer Vision

سال: 2023

ISSN: ['0920-5691', '1573-1405']

DOI: https://doi.org/10.1007/s11263-023-01871-1